tmp <- 10
tmp1 <- tmp * 242024-05-28
In general:
Covering this session’s topics
✏️ Easy to write
familiarity with code editor, libraries
💡 Easy to understand
structured, with consistent variable names, commented.
🔧 Easy to debug
clear naming, DRY, tests.
🏎️ Easy to run
(profiling, C++, using “optimized” code).
names should be consistent, descriptive, lower case, readable.
For which snippet is it easier to guess the context?
Functions are first-class citizens in R
Rethink for, while loops; “apply” instead
“To become significantly more reliable, code must become more transparent. In particular, nested conditions and loops must be viewed with great suspicion. Complicated control flows confuse programmers. Messy code often hides bugs.”
— Bjarne Stroustrup
…but why?
Say you want to extract the \(R^2\) from three linear models with different predictors (or formulae).
What’s the difference?
parallel packageYou can imagine wanting to run each of the apply/for loop iterations in parallel.
data.table is a package that extends the data.frame class.dplyr for large datasets.dt[i, j, by]
dt …iris summariesdt[i, j, by]
dt[i, j, by]
iris summaries (.VARS)dt[i, j, by]
.N).SD)dt[i, j, by]
dt (in place)dt[i, j, by]
You will need a walrus (operator):
:=
name := vector to act on a single column.(names) := list of vectors to act on multiple columns.dt (in place)dt[i, j, by]
dt[i, j, by]
_int suffix.new_cols <- paste0(numeric_cols, "_int")
iris_dt[, (new_cols) := lapply(
.SD, as.integer
), .SDcols = numeric_cols]; head(iris_dt, n=2) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_ita
1: 5.1 3.5 1.4 0.2 setosa setosino
2: 4.9 3.0 1.4 0.2 setosa setosino
Sepal.Length_int Sepal.Width_int Petal.Length_int Petal.Width_int
1: 5 3 1 0
2: 4 3 1 0
data.table is a powerful package for data manipulation.Advanced topics:
profvis package is a good package to use for this purpose.library(profvis)
library(data.table)
n <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(n * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(n)), data)
dataDF <- as.data.table(data)
numeric_vars <- setdiff(names(data), "id")
profvis({
means <- apply(data[, names(data) != "id"], 2, mean)
means <- colMeans(data[, names(data) != "id"])
means <- lapply(data[, names(data) != "id"], mean)
means <- vapply(data[, names(data) != "id"], mean, numeric(1))
means <- matrixStats::colMeans2(as.matrix(data[, names(data) != "id"]))
means <- dataDF[, lapply(.SD, mean), .SDcols = numeric_vars]
})Why are code reproducibility & generalisability important?
sessionInfo())kmeans_recip that does the following:
kmeans) on the resulting data set.iris data set (with K=3 clusters).iris)that's looking good
kmeans_recip <- function(data){
# Obtain numerical variables
cont_cols <- which(sapply(data, is.numeric))
# Check there is at least 1 numerical variable
if (length(cont_cols)==0) stop('No numerical variables!')
for (i in cont_cols){
# Check if numerical variable takes 0 value
ifelse(any(data[, i] == 0),
stop('Division by 0 not allowed!'),
data[, i] <- 1/data[, i])
}
# Apply K-Means clustering
kmeans_res <- kmeans(data[, cont_cols], centers = 3)
return(kmeans_res$cluster)
}
kmeans_recip(data = iris)[1:20] [1] 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 3 3 3 3 3
diamonds data set from the ggplot2 package.cat(), print(), message() etc.)traceback(), browser(), debug() (more details here)tryCatch() syntaxtryCatch() is the function to use for error handling in R.tryCatch example
safe_log <- function(x){
result <- tryCatch({
log(x) # Attempt to calculate the logarithm
},
warning = function(w){
message("A warning occurred: ", w) # Handle warnings
NULL # Return NULL if a warning occurs
},
error = function(e){
message("An error occurred: ", e) # Handle the error
NA # Return NA if an error occurs
},
finally = {
# This block executes no matter what
message("Logarithm attempt completed.")
})
return(result)
}Logarithm attempt completed.
[1] 7.612831
A warning occurred: simpleWarning in log(x): NaNs produced
Logarithm attempt completed.
NULL
An error occurred: Error in log(x): non-numeric argument to mathematical function
Logarithm attempt completed.
[1] NA